TeYang Lau
Created: 4/11/2020
Last update: 9/11/2020

Exploratory Data Analysis
3.1. By Flat Type
3.2. By Town
3.3. By Storeys
3.4. By Floor Area
3.5. By Block Number
3.6. By Flat Model
3.7. By Lease Commence Date
3.8. By Distance to Nearest Amenities
3.9. By Number of Amenities in 2km Radius
Data Preparation
4.1. Missing Values
4.2. Multicollinearity
4.3. Normality]
4.4. Label & Dummy Encoding
4.5. Feature Scaling
Linear Regression
5.1. Model Building & Results
5.2. Homoscedasticity and Normality of Residuals
5.3. Feature Importance
Random Forest
6.1. Out-Of-Bag
6.2. K-fold Cross Validation
6.3. Feature Importance
6.4. SHAP Values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
price1999 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-1990-1999.csv')
price2012 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv')
price2014 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv')
price2016 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv')
price2017 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')
cpi = pd.read_csv('Data/CPI.csv')
price1999.head()
price2012.head()
price2014.head()
price2016.head()
# Merge dfs
prices = pd.concat([price1999, price2012, price2014], sort=False)
prices = pd.concat([prices, price2016, price2017], axis=0, ignore_index=True, sort=False)
prices['month'] = pd.to_datetime(prices['month']) # to datetime
prices.info()
prices[~prices.isnull().any(axis=1)]['month'].dt.year.unique()
remaining_lease has lots of NAs. They are only available after 2015 sales onwards.
# Clean flat type
prices['flat_type'] = prices['flat_type'].str.replace('MULTI-GENERATION', 'MULTI GENERATION')
prices['flat_type'].unique()
# Rename flat model duplicates
replace_values = {'NEW GENERATION':'New Generation', 'SIMPLIFIED':'Simplified', 'STANDARD':'Standard', 'MODEL A-MAISONETTE':'Maisonette', 'MULTI GENERATION':'Multi Generation', 'IMPROVED-MAISONETTE':'Executive Maisonette', 'Improved-Maisonette':'Executive Maisonette', 'Premium Maisonette':'Executive Maisonette', '2-ROOM':'2-room', 'MODEL A':'Model A', 'MAISONETTE':'Maisonette', 'Model A-Maisonette':'Maisonette', 'IMPROVED':'Improved', 'TERRACE':'Terrace', 'PREMIUM APARTMENT':'Premium Apartment', 'Premium Apartment Loft':'Premium Apartment', 'APARTMENT':'Apartment', 'Type S1':'Type S1S2', 'Type S2':'Type S1S2'}
prices = prices.replace({'flat_model': replace_values})
prices['flat_model'].value_counts()
Types of Flat Models:
Standard: (1/2/3/4/5-room). 1960s HDB. Had WC and shower in same room. 5-room Standard were introduced in 1974.
Improved: (1/2/3/4/5-room). Introduced in 1966, the 3/4-room having separate WC and shower, they also featured void decks. 5-room Improved were introduced in 1974.
New Generation: Started first in 1975, New Generation flats can be 3-Room (67 / 82 sqm) or 4-Room (92 sqm), featuring en-suite toilet for master bedroom, with pedestal type Water Closet, plus store room.
Model A: Introduced in 1981: 3-Room (75 sqm), 4-Room (105 sqm), 5-Room (135 sqm), 5-Room Maisonette (139 sqm)
Model A2: Smaller units of Model A. e.g., 4-Room Model A2 (90 sqm)
Simplified: Introduced in 1984: 3-Room (64 sqm), 4-Room (84 sqm)
Multi Generation: 3Gen flats designed to meet the needs of multi-generation families.
Maisonette: AKA Model A Maisonette — 2 storeys HDB flat
Premium Apartment: Introduced somewhere during 1990s, featuring better quality finishes, you get them in ready-to-move condition, with flooring, kitchen cabinets, built-in wardrobes
Executive Maisonette: More premium version of Model A Maisonettes. These units are no longer being built after being replaced by the Executive Condominium (EC) scheme in 1995
Executive Apartment: Executive Apartment / Maisonette (146-150 sqm) were introduced in 1983 and replaced 5-Room Model A flats, in addition of the 3-bedroom and separate living/dining found in 5A flats, EA and EM feature an utility/maid room. 80% of Executive units were Maisonettes and 20% were Apartments.
DBBS: public apartments built under the HDB's short-lived Design, Build and Sell Scheme (DBSS) from 2005 to 2012. They are a unique (and premium) breed of HDB flats in Singapore, which are built by private developers. High Prices. Quite similiar to Executive Condominium except DBBS is like a premium HDB without facilities of private condos
Adjoined Flat: Large HDB flats which are combined from 2 HDB flats
Terrace: HDB terrace flats built before HDB, without realizing Singapore's land constraint. Discontinued
Type S1S2: apartments at The Pinnacle@Duxton are classified as "S" or Special apartments in view of its historical significance and award-winning design. For application of HDB policies, S1 and S2 apartments will be treated as 4-room and 5-room flats respectively
2-room: Most likely refers to 2-room flexi where there is 1 bedroom and 1 common area
prices['storey_range'].unique()
prices['town'].unique()
plt.hist(prices['floor_area_sqm'], bins=50, edgecolor='black')
plt.title('Distribution of HDB Floor Area')
plt.show()
display(prices[prices['floor_area_sqm'] > 200]['flat_model'].value_counts())
The floor area outliers mostly belong to special HDBs that are larger than the standard ones. So they might not be outliers in a multivariate sense.
bins = prices['lease_commence_date'].max() - prices['lease_commence_date'].min()
plt.hist(prices['lease_commence_date'], bins=bins, edgecolor='black')
plt.title('Distribution of Lease Commence Year')
plt.show()
# Compute Resale Price Adjusted for Inflation Using Consumer Price Index for Housing & Utilities
# https://www.singstat.gov.sg/find-data/search-by-theme/economy/prices-and-price-indices/latest-data
cpi['month'] = pd.to_datetime(cpi['month'], format='%Y %b') # to datetime
prices = prices.merge(cpi, on='month', how='left')
# https://people.duke.edu/~rnau/411infla.htm
prices['real_price'] = (prices['resale_price'] / prices['cpi']) * 100
# Plot Median Resale Prices Over the Years
# Unadjusted
fig = plt.figure(figsize=(14,4.5))
fig.suptitle('Median HDB Resale Prices Over the Years', fontsize=18)
ax1 = fig.add_subplot(121)
prices.groupby('month')[['resale_price']].median().plot(ax=ax1, legend=None)
ax1.set_xlabel('Date'), ax1.set_ylabel('Resale Price in SGD ($)'), ax1.set_ylim(0, 500000), ax1.set_title('Unadjusted for Inflation', size=15)
# Adjusted
# https://jakevdp.github.io/PythonDataScienceHandbook/04.09-text-and-annotation.html
ax2 = fig.add_subplot(122)
prices.groupby('month')[['real_price']].median().plot(ax=ax2, legend=None)
ax2.set_xlabel('Date'), ax2.set_ylabel('Resale Price in SGD ($)'), ax2.set_ylim(0, 500000), ax2.set_title('Adjusted for Inflation to 2019 SGD',size=15)
ax2.annotate('1997 Asian Financial Crisis\nMedian: $403,766', xy=('1997-05-01',380000), xycoords='data',
bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="gray"), xytext=(50,-140), textcoords='offset points', ha='center',
arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))
ax2.annotate('2013 Cooling Measures\nMedian: $401,887', xy=('2013-07-01',380000), xycoords='data',
bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="gray"), xytext=(0,-90), textcoords='offset points', ha='center',
arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))
plt.tight_layout(rect=[0, 0, 0.9, 0.9])
plt.show()
#prices.set_index('month').loc['1997']['real_price'].median()
Following the collapse of the thai Baht in July 1997, housing prices in Singapore continue to fall and only started gradually increasing again around 2004. In 2013, it experienced a decline due to 'Propery Market Cooling Measures', such as the Additional Buyer's Stamp Duty (ABSD), Loan-to-Value (LTV) Ratio, and Total Debt Servicing Ratio (TDSR). Refer here for more information.
# Convert remaining_lease to number of years
def getYears(text):
if isinstance(text, str):
yearmonth = [int(s) for s in text.split() if s.isdigit()]
if len(yearmonth) > 1: # if there's year and month
years = yearmonth[0] + (yearmonth[1]/12)
else: # if only year
years = yearmonth[0]
return years
else: # if int
return text
prices['remaining_lease'] = prices['remaining_lease'].apply(lambda x: getYears(x))
prices.info()
bins = prices['remaining_lease'].max() - prices['remaining_lease'].min()
plt.hist(prices['remaining_lease'], bins=int(bins), edgecolor='black')
plt.title('Distribution of Remaining Lease for 2016-2020 Data')
plt.show()
prices.head()
## Waffle chart for flat type - number of rooms
from pywaffle import Waffle
flattype = dict(prices['flat_type'].value_counts()/len(prices)*100)
flattype1519 = dict(prices.set_index('month')['2015':'2019'].reset_index()['flat_type'].value_counts()/len(prices.set_index('month')['2015':'2019'].reset_index())*100)
plt.figure(figsize=(10,5),
FigureClass=Waffle,
plots={
'211': {
'values': flattype,
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
'title': {'label': 'Proportion of HDB Flat Types (All Years)', 'loc': 'left', 'fontsize':16}
},
'212': {
'values': flattype1519,
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
'title': {'label': 'Proportion of HDB Flat Types (2015-2019)', 'loc': 'left', 'fontsize':16}
},
},
rows=5,
colors=["#1f77b4", "#ff7f0e", "green", 'purple', 'black', 'yellow', 'brown'],
icons='home',
font_size=22,
icon_legend=True)
plt.show()
There are not many 1 Room, 2 Room and Multi Generation flats, so they will be removed for looking at flat types.
flattype = ['3 ROOM','4 ROOM','5 ROOM','EXECUTIVE']
prices1519 = prices.set_index('month').sort_index().loc['2015-01':'2019-12']
prices1519 = prices1519[prices1519['flat_type'].isin(flattype)][['flat_type','real_price']].reset_index()
prices1519['flat_type_year'] = prices1519['flat_type'] + ' - ' + prices1519['month'].apply(lambda x: str(x)[:4])
prices1519
import joypy
fig, axes = joypy.joyplot(prices1519, by="flat_type_year", column="real_price",figsize=(9,7),
linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',4))
axes[-1].set_xlim([0,1200000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k'])
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()
We don't see much changes within each flat type through the last 5 years. The only consistent pattern is that prices become higher as flats have more rooms, which is unsurprising.
## 1997 vs 2019
# all room type
prices['year'] = pd.DatetimeIndex(prices['month']).year # extract out year
prices9719 = prices[prices['year'].isin([1997,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719['change'] = prices9719.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719 = prices9719[prices9719['change'].notnull()]
prices9719 = prices9719.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719['color'] = prices9719['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
# 4-room
prices9719_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([1997,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719_4room['change'] = prices9719_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719_4room = prices9719_4room[prices9719_4room.change.notnull()]
prices9719_4room = prices9719_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719_4room['color'] = prices9719_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
## 2018 vs 2019
# all room type
prices1819 = prices[prices['year'].isin([2018,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819['change'] = prices1819.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819 = prices1819[prices1819['change'].notnull()]
prices1819 = prices1819.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819['color'] = prices1819['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
# 4-room
prices1819_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([2018,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819_4room['change'] = prices1819_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819_4room = prices1819_4room[prices1819_4room.change.notnull()]
prices1819_4room = prices1819_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819_4room['color'] = prices1819_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
def loll_plot(df, x, y, subtitle, xlabel, xlim):
plt.rc('axes', axisbelow=True)
plt.grid(linestyle='--', alpha=0.4)
plt.hlines(y=df.index, xmin=0, xmax=df[x], color=df.color, linewidth=1)
plt.scatter(df[x], df.index, color=df.color, s=300)
for i, txt in enumerate(df[x]):
plt.annotate(str(round(txt)), (txt, i), color='white', fontsize=9, ha='center', va='center')
plt.annotate(subtitle, xy=(1, 0), xycoords='axes fraction', fontsize=20,
xytext=(-5, 5), textcoords='offset points',
ha='right', va='bottom')
plt.yticks(df.index, df[y]); plt.xticks(fontsize=12); plt.xlim(xlim)
plt.xlabel(xlabel, fontsize=14)
fig = plt.figure(figsize=(12,7))
ax1 = plt.subplot(121)
loll_plot(prices9719, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-40,125])
ax2 = plt.subplot(122)
loll_plot(prices9719_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-40,125])
fig.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.suptitle('1997 vs 2019, Median Price of Flats', fontsize=16)
plt.show()
Queenstown appears to have one of the highest increases in resale price, which could be due to it being developed over the past 2 decades.
fig = plt.figure(figsize=(12,7))
ax1 = plt.subplot(121)
loll_plot(prices1819, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-30,20])
ax2 = plt.subplot(122)
loll_plot(prices1819_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-30,20])
fig.tight_layout(pad=0.5, rect=[0, 0, 0.9, 0.9])
plt.suptitle('2018 vs 2019, Median Price of Flats', fontsize=16)
plt.show()
The changes are not very large from 2018 to 2019, although prices for Toa Payoh has dropped and Central Area 4-room flats have also dropped by 20%. Could it be because these areas have older HDB meaning their lease are now shorter? As shown below, it seems that places like Punggol, and Sengkang, tend to have later lease commence date, as they were developed later, which might have led to their slight increase in prices, while places like Toa Payoh and Central Area, tend to have older lease commence date.
prices[prices['year'].isin([2018,2019])].groupby('town')['lease_commence_date'].median().sort_values()
fig = plt.figure(figsize=(12,4))
# Storey Prices
ax1 = plt.subplot(121)
storey = prices.groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey['storey_rank'] = storey['storey_range'].astype('category').cat.codes # label encode
a=sns.scatterplot(x=storey['storey_rank'], y=storey['real_price'], s=storey['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax1)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_xticklabels(pd.Series(['']).append(storey.iloc[[0,5,10,15,20,24],0]))
ax1.set_ylim([280000,1100000]), ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Storey', size=15)
ax1.set_title('All Years', size=15)
# Floor Area Prices
ax2 = plt.subplot(122)
storey2 = prices[prices['year'].isin([2015,2016,2017,2018,2019])].groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey2['storey_rank'] = storey2['storey_range'].astype('category').cat.codes
b=sns.scatterplot(x=storey2['storey_rank'], y=storey2['real_price'], s=storey2['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax2)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000]
ax2.set_yticklabels(ylabels); ax2.set_ylabel('')
ax2.set_xticks([0,4,8,12,16])
ax2.set_xticklabels(storey2.iloc[[0,4,8,12,16],0])
ax2.set_ylim([280000,1100000]), ax2.set_xlabel('Storey', size=15)
ax2.set_title('2015 to 2019', size=15)
plt.show()
Surprisingly, this follows a linear relationship, with higher storeys being sold at a higher price.
# Floor Area Prices
area = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
p=sns.regplot(x='floor_area_sqm', y='real_price', data=area, scatter_kws={"s": 1, 'alpha':0.5})
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
p.set_xlabel('Floor Area (Square Meters)', size=15)
plt.close()

display(area[area['floor_area_sqm'] > 200])
Those cases on top right of the chart consists of flats that are either Terrace or Executive Maisonette, which is not surprising.
3 digit system was introduced in the 1970s, with the 1st digit representing a neighbourhood in a town. So for e.g., AMK neighbourhood 1 starts with 101, and AMK neighbourhood 2 starts with 201. So first digit was separated from last 2 digits and plotted separately
import re
# Block Number Prices
get_num = lambda x: int(re.findall("\d+", x)[0])
prices['blocknum'] = prices['block'].apply(get_num) # get only digits from block number
tmp = prices[prices['blocknum'] > 99] # get only blocks that use 3-digit numbering system
tmp = tmp.groupby('blocknum')['real_price'].median().reset_index()
fig = plt.figure(figsize=(12,4))
ax1 = plt.subplot(121)
a=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[0])), y=tmp['real_price'], color='#ff9933', edgecolors='w', alpha=0.9)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Neighbourhood Number', size=15)
ax2 = plt.subplot(122)
b=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[1:])), y=tmp['real_price'], edgecolors='w', alpha=0.9)
ax2.set_yticklabels(ylabels)
ax2.set_ylabel('', size=15)
ax2.set_xlabel('Block Number', size=15)
plt.show()
fig = plt.figure(figsize=(12,7))
p=sns.violinplot(x='flat_model', y='real_price', data=prices, width=1,
order=prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist())
p.set_xticklabels(p.get_xticklabels(), rotation=30, ha='right'), p.set_xlabel('Flat Models', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
plt.show()
The special models like the Type S1S2 (The Pinnacle@Duxton) and Terrace tend to fetch higher prices while the older models from the 1900s tend to go lower.
# ridgeline plot
# tmp = prices.set_index('flat_model')
# tmp = tmp.loc[prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist()].reset_index().groupby("flat_model", sort=False)
# fig, axes = joypy.joyplot(tmp, by="flat_model", column="real_price",figsize=(12,5),
# linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
# axes[-1].set_xlim([-50000,1400000])
# axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k'])
# plt.xlabel('Resale Price SGD ($)', fontsize=14)
# fig.show()
fig = plt.figure(figsize=(7,9))
p=sns.boxplot(y='lease_commence_date', x='real_price', data=prices, width=1, orient='h', flierprops = dict(markerfacecolor = 'red', markersize = 0.1, linestyle='none'), linewidth=0.4)
p.set_xlabel('Resale Price SGD ($)', size=15), p.set_ylabel('Lease Commence Year', size=15)
xlabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_xticks()/1000]
p.set_xticklabels(xlabels)
p.set_title('Resale Price By Lease Commence Year', size=15)
plt.show()
tmp = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
fig, axes = joypy.joyplot(tmp, by="lease_commence_date", column="real_price",figsize=(6,10),
linewidth=1,overlap=5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
axes[-1].set_xlim([-50000,1400000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k'])
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()
The names of schools, supermarkets, hawkers, shopping malls, parks and MRTs were downloaded/scraped from Data.gov.sg and Wikipedia and fed through a function that uses OneMap.sg api to get their coordinates (latitude and longitude). These coordinates were then fed through other functions which uses geopy package to get the distance between locations. By doing this, the nearest distance of each amenity from each house can be computed, as well as the number of each amenity within a 2km radius of each flat.
Some of Kallang/Whampoa has no distance due to search error on OneMap.sg.
flat_amenities = pd.read_csv('Data/flat_amenities.csv')
prices1519 = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
prices1519['flat'] = prices['block'] + ' ' + prices['street_name']
prices1519 = prices1519.merge(flat_amenities, on='flat', how='left')
d_region = {'ANG MO KIO':'North East', 'BEDOK':'East', 'BISHAN':'Central', 'BUKIT BATOK':'West', 'BUKIT MERAH':'Central',
'BUKIT PANJANG':'West', 'BUKIT TIMAH':'Central', 'CENTRAL AREA':'Central', 'CHOA CHU KANG':'West',
'CLEMENTI':'West', 'GEYLANG':'Central', 'HOUGANG':'North East', 'JURONG EAST':'West', 'JURONG WEST':'West',
'KALLANG/WHAMPOA':'Central', 'MARINE PARADE':'Central', 'PASIR RIS':'East', 'PUNGGOL':'North East',
'QUEENSTOWN':'Central', 'SEMBAWANG':'North', 'SENGKANG':'North East', 'SERANGOON':'North East', 'TAMPINES':'East',
'TOA PAYOH':'Central', 'WOODLANDS':'North', 'YISHUN':'North'}
prices1519['region'] = prices1519['town'].map(d_region)
colors = {'North East':'Purple', 'East':'Green', 'Central':'Brown', 'West':'Red', 'North':'Orange'}
tmp = prices1519.groupby('town')[['dist_dhoby','school_dist','num_school_2km','hawker_dist','num_hawker_2km','park_dist','num_park_2km','mall_dist','num_mall_2km','mrt_dist','num_mrt_2km','supermarket_dist','num_supermarket_2km','real_price']].median().reset_index()
tmp['region'] = tmp['town'].map(d_region)
fig, ax = plt.subplots(figsize=(8,5))
grouped = tmp.groupby('region')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='dist_dhoby', y='real_price', label=key, color=colors[key], s=60)
b, a = np.polyfit(tmp['dist_dhoby'], tmp['real_price'], 1)
ax.plot(tmp['dist_dhoby'], a + b* tmp['dist_dhoby'], '-')
ax.set_xlim([0,20]), ax.set_xlabel('Distance from Dhoby Ghaut MRT (km)', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000]
ax.set_yticklabels(ylabels), ax.set_ylabel('Resale Price SGD ($)', size=15)
for i, txt in enumerate(tmp['town']):
ax.annotate(txt, (tmp['dist_dhoby'][i]+0.3, tmp['real_price'][i]), size=8, alpha=0.9)
plt.show()
Relationship is negative, with flats that are further away from Dhoby Ghaut MRT (Central), having lower resale prices.
p=sns.pairplot(tmp, x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Distance from Nearest Amenities (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
Marine Parage is far from MRT station.
p=sns.pairplot(prices1519[prices1519['school_dist']<3], x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e'))) # remove outliers (>3km)
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price VS Distance from Nearest Amenities')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Surprisingly, the relationship with school is not as strong as expected, while distance from hawker is more pronounced from the median plots.
p=sns.pairplot(tmp, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Number of Amenities in 2km Radius (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
p=sns.pairplot(prices1519, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e')))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price VS Number of Amenities in 2km Radius')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Punggol has its own LRT and Central Area has lots of stations and malls nearby.
# clear unused variables
del(price1999, price2012, price2014, price2016, price2017, prices1819, prices1819_4room, prices9719, prices9719_4room,
storey, storey2, tmp, xlabels, ylabels, p, grouped, flattype2019, flattype, flat_amenities, cpi, ax, ax1, ax2, area)
import gc
gc.collect()
Replace missing distance values with median of the town. Only Kallang/Whampoa has missing data, so the function below will replace them with the median of the Kallang/Whampoa distance variables.
df = prices1519[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model', 'lease_commence_date', 'year', 'school_dist', 'num_school_2km', 'hawker_dist', 'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist', 'num_supermarket_2km', 'dist_dhoby', 'region', 'real_price']]
def replace_NA_median(df, columns):
for c in columns:
df[c] = df.groupby("town").transform(lambda x: x.fillna(x.median()))[c]
return df
df = replace_NA_median(df, ['school_dist', 'num_school_2km', 'hawker_dist',
'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist',
'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist',
'num_supermarket_2km', 'dist_dhoby'])
df.info()
# Correlation heatmap
fig = plt.figure(figsize=(10,10))
ax = sns.heatmap(df.select_dtypes(include=['int64','float64']).corr(), annot = True, fmt='.2g',
vmin=-1, vmax=1, center= 0, cmap= 'coolwarm_r', linecolor='black', linewidth=1, annot_kws={"size": 7})
#ax.set_ylim(0 ,5)
plt.xticks(rotation=45, ha='right')
plt.title('Correlation Heatmap')
fig.show()
# Multicollinearity
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['tolerance'] = 1/vif.VIF
vif['meanVIF'] = vif.VIF.mean()
return(vif)
calc_vif(df.drop('real_price',axis=1).select_dtypes(include=['int64','float64']))
calc_vif(df.drop(['real_price','num_supermarket_2km','year','num_school_2km','dist_dhoby'],axis=1).select_dtypes(include=['int64','float64']))
# drop columns
lr_df = df.drop(['num_supermarket_2km','year','num_school_2km','dist_dhoby'], axis=1)
# Plot distribution for each continuous variable
lr_df.hist(bins=50, figsize=(15, 10), grid=False, edgecolor='black')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
Not all the variables follow a normal distribution, and most of the distances variables have some outliers. For these outliers, its better to use a multivariable approach like mahalanobis distance to see if they are outliers instead of using a univariate approach. If needed, some of these variables can be transformed to reduce the skewness.
from statsmodels.api import qqplot
fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2,figsize=(5,5))
ax1.hist(lr_df['real_price'], bins=50, edgecolor='black')
qqplot(lr_df['real_price'], line='s', ax=ax2)
ax3.hist(np.log(lr_df['real_price']), bins=50, edgecolor='black')
qqplot(np.log(lr_df['real_price']), line='s', ax=ax4)
plt.suptitle('Real Price Normality Before (Top) & After (Bottom) Logging')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

We will fit the linear regression first and come back to check on the normality of the residuals as well as homoscedasticity.
# Frequency plots for catergorical features
fig = plt.figure(figsize=(30,5))
for count, col in enumerate(df.select_dtypes(include=['category','object']).columns):
fig.add_subplot(1,5,count+1)
df[col].value_counts().plot.barh()
plt.title(col)
plt.yticks(rotation=45)
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
Region might be a better choice instead of town since town has lots of classes and one-hot encoding it might lead to a very sparse matrix. For flat_type, we can remove Multi Generation and 1 Room since there are not many instances of them. storey_range will be label encoded according to their levels while flat_model should be further grouped to reduce the number of classes.
# label encode storeys
df = df.sort_values(by='storey_range')
df['storey_range'] = df['storey_range'].astype('category').cat.codes # label encode
lr_df = lr_df.sort_values(by='storey_range')
lr_df['storey_range'] = lr_df['storey_range'].astype('category').cat.codes # label encode
# remove flat types with very few cases
df = df[~df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]
lr_df = lr_df[~lr_df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]
# Re-categorize flat model to reduce num classes
replace_values = {'Executive Maisonette':'Maisonette', 'Terrace':'Special', 'Adjoined flat':'Special',
'Type S1S2':'Special', 'DBSS':'Special', 'Model A2':'Model A', 'Premium Apartment':'Apartment', 'Improved':'Standard', 'Simplified':'Model A', '2-room':'Standard'}
df = df.replace({'flat_model': replace_values})
lr_df = lr_df.replace({'flat_model': replace_values})
# Label encode flat type
replace_values = {'2 ROOM':0, '3 ROOM':1, '4 ROOM':2, '5 ROOM':3, 'EXECUTIVE':4}
df = df.replace({'flat_type': replace_values})
lr_df = lr_df.replace({'flat_type': replace_values})
df = df.reset_index(drop=True)
display(df['flat_model'].value_counts())
lr_df = lr_df.reset_index(drop=True)
display(lr_df['flat_model'].value_counts())
display(lr_df.head())
## dummy encoding
df = pd.get_dummies(df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
df = pd.get_dummies(df, columns=['flat_model'], prefix=['model'])
df= df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline
lr_df = pd.get_dummies(lr_df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
lr_df = pd.get_dummies(lr_df, columns=['flat_model'], prefix=['model'])
lr_df= lr_df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline
Scaling is only done for linear regression. Tree-based models do not require scaling as it does not affect performance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# fit to continuous columns and transform
scaled_columns = ['floor_area_sqm','lease_commence_date','school_dist','hawker_dist','num_hawker_2km','park_dist',
'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist']
scaler.fit(lr_df[scaled_columns])
scaled_columns = pd.DataFrame(scaler.transform(lr_df[scaled_columns]), index=lr_df.index, columns=scaled_columns)
# separate unscaled features
unscaled_columns = lr_df.drop(scaled_columns, axis=1)
# concatenate scaled and unscaled features
lr_df = pd.concat([scaled_columns,unscaled_columns], axis=1)
display(lr_df.head())
from sklearn.linear_model import LinearRegression
lr_y = lr_df[['real_price']]
lr_X = lr_df.drop(['real_price','town'], axis=1)
lin_reg = LinearRegression()
lin_reg.fit(lr_X, np.log(lr_y))
print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(lr_X, np.log(lr_y))}')
import statsmodels.api as sm
# alternate way using statistical formula, which does not require dummy coding manually
# https://stackoverflow.com/questions/50733014/linear-regression-with-dummy-categorical-variables
# https://stackoverflow.com/questions/34007308/linear-regression-analysis-with-string-categorical-features-variables
X_constant = sm.add_constant(lr_X)
lin_reg = sm.OLS(np.log(lr_y),X_constant).fit()
lin_reg.summary()
ax = sns.scatterplot(x=np.log(lr_y)['real_price'], y=lin_reg.predict(), edgecolors='w', alpha=0.9, s=8)
ax.set_xlabel('Observed')#, ax.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_xticks()/1000])
ax.set_ylabel('Predicted')#, ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000])
ax.annotate('Adjusted R\u00b2: ' + str(round(lin_reg.rsquared_adj,2)), xy=(0, 1), xytext=(25, -25),
xycoords='axes fraction', textcoords='offset points', fontsize=12)
plt.close()

# Homoscedasticity and Normality of Residuals
pred = lin_reg.predict()
resids = lin_reg.resid
resids_studentized = lin_reg.get_influence().resid_studentized_internal
fig = plt.figure(figsize=(10,3))
ax1 = plt.subplot(121)
sns.scatterplot(x=pred, y=resids_studentized, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Predicted Values')
ax1.set_ylabel('Studentized Residuals')
ax2 = plt.subplot(122)
sns.distplot(resids_studentized, norm_hist=True, hist_kws=dict(edgecolor='w'))
ax2.set_xlabel('Studentized Residual')
plt.close()

# Feature Importance
lr_results = pd.read_html(lin_reg.summary().tables[1].as_html(),header=0,index_col=0)[0]
coefs = lr_results[['coef']][1:].reset_index().rename(columns={'index':'feature'})
coefs['feature_importance'] = np.abs(coefs['coef'])
coefs = coefs.sort_values('feature_importance').reset_index(drop=True)
coefs['color'] = coefs['coef'].apply(lambda x: '#66ff8c' if x>0 else '#ff8c66')
coefs.plot.barh(x='feature',y='feature_importance',color=coefs['color'],figsize=(6,7))
colors = {'Positive':'#66ff8c', 'Negative':'#ff8c66'}
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
legend = plt.legend(handles, labels, title='Relationship', fontsize = '15')
plt.setp(legend.get_title(),fontsize='17')
plt.xlabel('Standardized Coefficients', size=15), plt.ylabel('Features', size=15)
plt.ylim([-1,23])
plt.title('Linear Regression Feature Importance', size=15)
plt.show()
It seems that region is the feature that drives resale prices the most. The Central region was used as the baseline against which all other regions are compared to. This means that all the other regions tend to be sold lower as compared to the Central area. Floor area, lease commence date and special flat models are also positive drivers of resale prices while distance to nearest hawker center is a negative driver.
Let's look at whether a non-linear model will show consistent drivers for resale prices.
from sklearn.model_selection import train_test_split
y = df[['real_price']]
X = df.drop(['real_price','town', 'year'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1, shuffle=True, random_state=0)
print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of y_test:', y_test.shape)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from scipy.stats import spearmanr, pearsonr
rf = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=0)
rf.fit(X_train, y_train)
predicted_train = rf.predict(X_train)
print(f'Out-of-bag R\u00b2 score estimate: {rf.oob_score_:>5.3}')
predicted_test = rf.predict(X_test)
oob_test_score = r2_score(y_test['real_price'], predicted_test)
spearman = spearmanr(y_test['real_price'], predicted_test)
pearson = pearsonr(y_test['real_price'], predicted_test)
oob_mae = mean_absolute_error(y_test['real_price'], predicted_test)
print(f'Test data R\u00b2 score: {oob_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(oob_mae)}')
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_features': ['auto'], # max number of features considered for splitting a node
'max_depth': [20], # max number of levels in each decision tree
'min_samples_split': [15], # min number of data points placed in a node before the node is split
'min_samples_leaf': [2]} # min number of data points allowed in a leaf node
rfr =GridSearchCV(RandomForestRegressor(n_estimators = 500, n_jobs=-1, random_state=0),
param_grid, cv=10, scoring='r2', return_train_score=True)
rfr.fit(X_train,y_train)
print("Best parameters set found on Cross Validation:\n\n", rfr.best_params_)
print("\nCross Validation R\u00b2 score:\n\n", rfr.best_score_.round(3))
cv_predicted_test = rfr.predict(X_test)
cv_test_score = r2_score(y_test['real_price'], cv_predicted_test)
spearman = spearmanr(y_test['real_price'], cv_predicted_test)
pearson = pearsonr(y_test['real_price'], cv_predicted_test)
cv_mae = mean_absolute_error(y_test['real_price'], cv_predicted_test)
print(f'Test data R\u00b2 score: {cv_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(cv_mae)}')
fig = plt.figure(figsize=(13,4))
ax1 = plt.subplot(121)
ax1 = sns.scatterplot(x=y_test['real_price'], y=predicted_test, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Observed'), ax1.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_xticks()/1000])
ax1.set_ylabel('Predicted'), ax1.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_yticks()/1000])
ax1.annotate('Test R\u00b2: ' + str(round(oob_test_score,3)) + '\nTest MAE: ' + str(round(oob_mae)), xy=(0, 1), xytext=(25, -35),
xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax1.set_title('Tuned Using Out-Of-Bag')
ax2 = plt.subplot(122)
ax2 = sns.scatterplot(x=y_test['real_price'], y=cv_predicted_test, edgecolors='w', alpha=0.9, s=8)
ax2.set_xlabel('Observed'), ax2.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_xticks()/1000])
ax2.set_ylabel('Predicted'), ax2.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000])
ax2.annotate('Test R\u00b2: ' + str(round(cv_test_score,3)) + '\nTest MAE: ' + str(round(cv_mae)), xy=(0, 1), xytext=(25, -35),
xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax2.set_title('Tuned Using Cross Validation')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

fig = plt.figure(figsize=(14,7))
ax1 = plt.subplot(121)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rf.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
#plt.xticks(rotation=45, ha='right')
ax1.set_title('OOB Feature Importance', size=15)
ax2 = plt.subplot(122)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rfr.best_estimator_.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
ax2.set_title('CV Feature Importance', size=15)
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
fig.show()
The feature importance are abit different from the linear regression model. Floor area and lease commence date are still one of the main drivers of resale prices. However, features like distance from Dhoby Ghaut MRT and flat type also appears to be good drivers.
Tree-based models seem to give lower importance to categorical values. This is due to the importance score being a measure of how often the feature was selected for splitting and how much gain in purity was achieved as a result of the selection.
We can also look at feature importance using SHAP (SHapley Additive exPlanations) values first proposed by Lundberg and Lee (2006) for model interpretability of any machine learning model. SHAP values have a few advantages:
Below shows the contributors to resale price for an example of low, middle and high resale price flats.
import shap
shap.initjs()
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[16]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[16]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[5]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[5]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[1]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[1]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[100]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[100]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[172]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[172]])
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[10084]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[10084]])
print("Flat Type Encoding = 2 ROOM:0, 3 ROOM:1, 4 ROOM:2, 5 ROOM:3, EXECUTIVE:4")